AITopics | experimental setup

Scaling Law with Learning Rate Annealing

Neural Information Processing SystemsJun-19-2026, 04:46:34 GMT

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: L(s) = L0 +A S α1 C S2, where L(s)is the validation loss at step s, S1 is the area under the LR curve, S2 is the LR annealing area, and L0, A, C, αare constant parameters.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Workflow (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

d1881b5125b4e9cf42f6d6d0b6575934-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 20:50:19 GMT

artificial intelligence, machine learning, matrix, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)

Add feedback

What Knowledge Gets Distilled in Knowledge Distillation? Utkarsh Ojha Yuheng Li Anirudh Sundara Rajan Yingyu Liang Yong Jae Lee University of Wisconsin-Madison

Neural Information Processing SystemsApr-25-2026, 22:47:11 GMT

Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher?

artificial intelligence, machine learning, student, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Wisconsin > Dane County > Madison (0.40)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

Experimental Setup

Neural Information Processing SystemsApr-24-2026, 11:07:34 GMT

We provide an extended version of the Experimental Setup from Section 5 below. Linear Model This domain involves learning a linear model when the underlying mapping between features and predictions is cubic. Concretely, the aim is to choose the top B =1 out of N = 50 resources using a linear model. The fact that the features can be seen as 1-dimensional allows us to visualize the learned models (as seen in Figure 4). Predict: Given a feature xn U[0,1], use a linear model to predict the utility ˆyof choosing resource n, where the true utility is given by yn = 10x3n 6.5xn.

artificial intelligence, machine learning, prediction, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Appendix Table of Contents

Neural Information Processing SystemsFeb-17-2026, 06:28:53 GMT

There are several key limitations of the MADE algorithm: 1. As mentioned in Section 3.1, the MADE algorithm can only mask neural networks such that they respect the autoregressive property. The non-deterministic MADE masking algorithm presented in Germain et al. [2015], the resulting Proposition 1 formalizes this point. In Section 3.1, we showed that finding the weight masks for each neural network layer is equivalent Figure 7 provides a visual example of the steps performed by Algorithm 1. 's last row, we need the products of the last row of Randomly generated adjacency structures of 15 dimensions. IP gives better objective values when the adjacency matrix is very sparse.

artificial intelligence, machine learning, matrix, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)

Add feedback

SupplementaryMaterialfor" HierarchicalAdaptive ValueEstimationforMulti-modalVisual ReinforcementLearning "

Neural Information Processing SystemsFeb-15-2026, 22:13:27 GMT

Section C describes the details of the experimental setup, including network architectures, hyperparameters,andhardwaredetails. Thisoutcomeemphasizes the necessity of feature interaction or feature fusion to tackle intricate situations. Furthermore, an amalgamation of feature fusion and value fusion can offer better performance. This adjustment allows us to evaluate the robustness and adaptability of our approach in handling a larger number of vehicles in the environment. As we increase the number of vehicles on the road, Fig. A2 (a) clearly indicates that HAVE consistently delivers the highest performance. The training and testing curves of HAVE and other comparable methods are given in A4.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Industry: Transportation > Ground > Road (0.34)

Technology: